AITopics | mel spectrogram

Collaborating Authors

mel spectrogram

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Information Processing SystemsMar-16-2026, 21:29:30 GMT

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.42)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.39)

Add feedback

d81ecfc8fb18e833a3fa0a35d92532b8-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 05:19:47 GMT

noise ceiling, representation, voxel, (13 more...)

Neural Information Processing Systems

Industry: Health & Medicine (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis

Baoueb, Teysir, Bie, Xiaoyu, Fontaine, Mathieu, Richard, Gaël

arXiv.org Artificial IntelligenceDec-1-2025

Recent advances in diffusion models have positioned them as powerful generative frameworks for speech synthesis, demonstrating substantial improvements in audio quality and stability. Nevertheless, their effectiveness in vocoders conditioned on mel spectrograms remains constrained, particularly when the conditioning diverges from the training distribution. The recently proposed GLA-Grad model introduced a phase-aware extension to the WaveGrad vocoder that integrated the Griffin-Lim algorithm (GLA) into the reverse process to reduce inconsistencies between generated signals and conditioning mel spectrogram. In this paper, we further improve GLA-Grad through an innovative choice in how to apply the correction. Particularly, we compute the correction term only once, with a single application of GLA, to accelerate the generation process. Experimental results demonstrate that our method consistently outperforms the baseline models, particularly in out-of-domain scenarios.

artificial intelligence, machine learning, spectrogram, (16 more...)

arXiv.org Artificial Intelligence

2511.22293

Country: Europe (0.47)

Genre:

Workflow (0.69)
Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.61)

Add feedback

MicCheck: Repurposing Off-the-Shelf Pin Microphones for Easy and Low-Cost Contact Sensing

Oh, Steven, Inui, Tai, Kuan, Magdeline, Lin, Jia-Yeu

arXiv.org Artificial IntelligenceNov-25-2025

Robotic manipulation tasks are contact-rich, yet most imitation learning (IL) approaches rely primarily on vision, which struggles to capture stiffness, roughness, slip, and other fine interaction cues. Tactile signals can address this gap, but existing sensors often require expensive, delicate, or integration-heavy hardware. In this work, we introduce MicCheck, a plug-and-play acoustic sensing approach that repurposes an off-the-shelf Bluetooth pin microphone as a low-cost contact sensor. The microphone clips into a 3D-printed gripper insert and streams audio via a standard USB receiver, requiring no custom electronics or drivers. Despite its simplicity, the microphone provides signals informative enough for both perception and control. In material classification, it achieves 92.9% accuracy on a 10-class benchmark across four interaction types (tap, knock, slow press, drag). For manipulation, integrating pin microphone into an IL pipeline with open source hardware improves the success rate on picking and pouring task from 0.40 to 0.80 and enables reliable execution of contact-rich skills such as unplugging and sound-based sorting. Compared with high-resolution tactile sensors, pin microphones trade spatial detail for cost and ease of integration, offering a practical pathway for deploying acoustic contact sensing in low-cost robot setups.

artificial intelligence, machine learning, microphone, (15 more...)

arXiv.org Artificial Intelligence

2511.18299

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Information Processing SystemsNov-20-2025, 22:22:08 GMT

multispeaker text-to-speech synthesis, speaker verification, transfer learning, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.42)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.39)

Add feedback

HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

Bai, Bingsong, Geng, Yizhong, Wang, Fengping, Wang, Cong, Guo, Puyuan, Gao, Yingming, Li, Ya

arXiv.org Artificial IntelligenceNov-18-2025

Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

large language model, machine learning, voice conversion, (16 more...)

arXiv.org Artificial Intelligence

2511.08496

Country: Asia (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

H-Infinity Filter Enhanced CNN-LSTM for Arrhythmia Detection from Heart Sound Recordings

Kumar, Rohith Shinoj, Dinda, Rushdeep, Tyagi, Aditya, B., Annappa, R, Naveen Kumar M.

arXiv.org Artificial IntelligenceNov-5-2025

Early detection of heart arrhythmia can prevent severe future complications in cardiac patients. While manual diagnosis still remains the clinical standard, it relies heavily on visual interpretation and is inherently subjective. In recent years, deep learning has emerged as a powerful tool to automate arrhythmia detection, offering improved accuracy, consistency, and efficiency. Several variants of convolutional and recurrent neural network architectures have been widely explored to capture spatial and temporal patterns in physiological signals. However, despite these advancements, current models often struggle to generalize well in real-world scenarios, especially when dealing with small or noisy datasets, which are common challenges in biomedical applications. In this paper, a novel CNN-H-Infinity-LSTM architecture is proposed to identify arrhythmic heart signals from heart sound recordings. This architecture introduces trainable parameters inspired by the H-Infinity filter from control theory, enhancing robustness and generalization. Extensive experimentation on the PhysioNet CinC Challenge 2016 dataset, a public benchmark of heart audio recordings, demonstrates that the proposed model achieves stable convergence and outperforms existing benchmarks, with a test accuracy of 99.42% and an F1 score of 98.85%.

artificial intelligence, classification, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2511.02379

Country:

Asia > India > Karnataka (0.04)
North America > United States (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation

Chen, Jun, Hu, Shichao, Lin, Jiuxin, Li, Wenjie, Zhang, Zihan, Li, Xingchen, Liu, JinJiang, Xiao, Longshuai, Weng, Chao, Xie, Lei, Wu, Zhiyong

arXiv.org Artificial IntelligenceOct-14-2025

In-car multi-zone speech separation, which captures voices from different speech zones, plays a crucial role in human-vehicle interaction. Although previous SpatialNet has achieved notable results, its high computational cost still hinders real-time applications in vehicles. To this end, this paper proposes LSZone, a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation. We design a spatial information extraction-compression (SpaIEC) module that combines Mel spectrogram and Interaural Phase Difference (IPD) to reduce computational burden while maintaining performance. Additionally, to efficiently model spatial information, we introduce an extremely lightweight Conv-GRU crossband-narrowband processing (CNP) module. Experimental results demonstrate that LSZone, with a complexity of 0.56G MACs and a real-time factor (RTF) of 0.37, delivers impressive performance in complex noise and multi-speaker scenarios.

machine learning, natural language, real time system, (19 more...)

arXiv.org Artificial Intelligence

2510.10687

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

ArabEmoNet: A Lightweight Hybrid 2D CNN-BiLSTM Model with Attention for Robust Arabic Speech Emotion Recognition

Abouzeid, Ali, Elbouardi, Bilal, Maged, Mohamed, Shehata, Shady

arXiv.org Artificial IntelligenceSep-3-2025

Speech emotion recognition is vital for human-computer interaction, particularly for low-resource languages like Arabic, which face challenges due to limited data and research. We introduce ArabEmoNet, a lightweight architecture designed to overcome these limitations and deliver state-of-the-art performance. Unlike previous systems relying on discrete MFCC features and 1D convolutions, which miss nuanced spectro-temporal patterns, ArabEmoNet uses Mel spectrograms processed through 2D convolutions, preserving critical emotional cues often lost in traditional methods. While recent models favor large-scale architectures with millions of parameters, ArabEmoNet achieves superior results with just 1 million parameters, 90 times smaller than HuBERT base and 74 times smaller than Whisper. This efficiency makes it ideal for resource-constrained environments. ArabEmoNet advances Arabic speech emotion recognition, offering exceptional performance and accessibility for real-world applications.

emotion recognition, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.01401

Country: Asia (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.96)
Information Technology > Artificial Intelligence > Natural Language (0.96)

Add feedback

Amplifying Emotional Signals: Data-Efficient Deep Learning for Robust Speech Emotion Recognition

Vu, Tai

arXiv.org Artificial IntelligenceSep-3-2025

Speech Emotion Recognition (SER) presents a significant yet persistent challenge in human-computer interaction. While deep learning has advanced spoken language processing, achieving high performance on limited datasets remains a critical hurdle. This paper confronts this issue by developing and evaluating a suite of machine learning models, including Support Vector Machines (SVMs), Long Short-Term Memory networks (LSTMs), and Convolutional Neural Networks (CNNs), for automated emotion classification in human speech. We demonstrate that by strategically employing transfer learning and innovative data augmentation techniques, our models can achieve impressive performance despite the constraints of a relatively small dataset. Our most effective model, a ResNet34 architecture, establishes a new performance benchmark on the combined RAVDESS and SAVEE datasets, attaining an accuracy of 66.7% and an F1 score of 0.631. These results underscore the substantial benefits of leveraging pre-trained models and data augmentation to overcome data scarcity, thereby paving the way for more robust and generalizable SER systems.

artificial intelligence, deep learning, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2509.00077

Genre: Research Report > New Finding (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback